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Abstract 



^ I We consider the problem of designing an overlay network and routing mechanism that per- 

^ ■ mits finding resources efficiently in a peer-to-peer system. We argue that many existing ap- 

proaches to this problem can be modeled as the construction of a random graph embedded in a 
metric space whose points represent resource identifiers, where the probability of a connection 
between two nodes depends only on the distance between them in the metric space. We study 
the performance of a pcer-to-peer system where nodes arc embedded at grid points in a simple 
metric space: a one-dimensional real line. We prove upper and lower bounds on the message 

■ complexity of locating particular resources in such a system, under a variety of assumptions 
\ about failures of cither nodes or the connections between them. Our lower bounds in particular 

O ■ show that the use of inverse power-law distributions in routing, as suggested by Kleinberg |3| , is 

close to optimal. We also give efficient heuristics to dynamically maintain such a system as new 
p<J ■ nodes arrive and old nodes depart. Finally, we give experimental results that suggest promising 

J> ' directions for future work. 

(N 
(N 

^ ! 1 Introduction 
O . 

■ Peer-to-peer systems are distributed systems without any central authority and with varying com- 
\ putational power at each machine. We study the problem of locating resources in such a large 

^ ■ network of heterogeneous machines that are subject to crash failures. We describe how to con- 

struct distributed data structures that have certain desirable properties and allow efficient resource 
location. 

^ ' Decentralization is a critical feature of such a system as any central server not only provides a 

■ vulnerable point of failure but also wastes the power of the clients. Equally important is scalability: 
the cost borne by each node must not depend too much on the network size and should ideally 
be proportional, within polylogarithmic factors, to the amount of data the node seeks or provides. 
Since we expect nodes to arrive and depart at a high rate, the system should be resilient to both 

*This is an extended version of the paper appearing in the proceedings of the Twenty-First ACM Symposium on 
Principles of Distributed Computing, 2002 
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link and node failures. Furthermore, disruptions to parts of the data structure should self-heal to 
provide self-stabilization. 

Our approach provides a hash table-like functionality, based on keys that uniquely identify the 
system resources. To accomplish this, we map resources to points in a metric space either directly 
from their keys or from the keys' hash values. This mapping dictates an assignment of nodes 
to metric-space points. We construct and maintain a random graph linking these points and use 
greedy routing to traverse its edges to find data items. The principle we rely on is that failures leave 
behind yet another (smaller) random graph, ensuring that the system is robust even in the face of 
considerable damage. Another compelling advantage of random graphs is that they eliminate the 
need for global coordination. Thus, we get a fully-distributed, egalitarian, scalable system with no 
bottlenecks. 

We measure performance in terms of the number of messages sent by the system for a search 
or an insert operation. The self-repair mechanism may generate additional traffic, but we expect 
to amortize these costs over the search and insert operations. Given the growing storage capacity 
of machines, we are less concerned with minimizing the storage at each node; but in any case the 
space requirements are small. The information stored at a node consists only of a network address 
for each neighbor. 

The rest of the paper is organized as follows. Section [21 explains our abstract model in detail, 
and Section |21 describes some existing peer-to-peer systems. We prove our results for routing in 
Section 0] In Section |SJ we present a heuristic method for constructing the random graph and 
provide experimental results that show its performance in practice. Section El describes results 
of experiments we performed to test the routing performance of our constructed distributed data 
structure. Conclusions and future work are discussed in Section [3 

2 Our approach 

The idea underlying our approach consists of three basic parts: (1) embed resources as points in a 
metric space, (2) construct a random graph by appropriately linking these points, and (3) efficiently 
locate resources by routing greedily along the edges of the graph. Let i? be a set of resources spread 
over a large, heterogeneous network N. For each resource r £ R, owner{r) denotes the node in N 
that provides r and key{r) denotes the resource's key. Let K be the set of all possible keys. We 
assume a hash function h : K ^ V such that resource r maps to the point v = h{key{r)) in a metric 
space (y,c?), where V is the point set and d is the distance metric as shown in Figure E The hash 
function is assumed to populate the metric space evenly. Note that via this resource embedding, a 
node n is mapped onto the set F„ = {f G 1/ : 3r G i?, f = h{key{r)) A {owner{r) = n)}, namely 
the set of metric-space points assigned to the resources the node provides. 

Our next step is to carefully construct a directed random graph from the points embedded in 
V. We assume that each newly-arrived node n is initially connected to some other node in A^. Each 
node n generates the outgoing links for each vertex v £ Vn independently. A link (f,n) G x Vm 
simply denotes that n knows that m is the network node that provides the resource mapped to 
u; hence, we can view the graph as a virtual overlay network of information, pieces of which are 
stored locally at each node. Node n constructs each link by executing the search algorithm to 
locate the resource that is mapped to the sink of that link. If the metric space is not populated 
densely enough, the choice of a sink may result in a vertex corresponding to an absent resource. 
In that case, n chooses the neighbor present closest to the original sink. Moving to nearby vertices 
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Figure 1: An example of the metric-space embedding. 



will introduce some bias in the link distribution, but the magnitude of error does not appear to be 
large. A more detailed description of the graph construction is given in Section El 

Having constructed the overlay network of information, we can now use it for resource location. 
As new nodes arrive, old nodes depart, and existing ones alter the set of resources they provide or 
even crash, the resources available in the distributed database change. At any time i, let i?* C 
be the set of available resources and /* be the corresponding overlay network. A request by node n 
to locate resource r at time t is served in a simple, localized manner: n calculates the metric-space 
point V that corresponds to r, and a request message is then routed over /* from the vertex in 
that is closest to v v itself.^ Each node needs only local information, namely its set of neighbors 
in /*, to participate in the resource location. Routing is done greedily by forwarding the message 
to the node mapped to a metric-space point as close to v as possible. The problem of resource 
location is thus translated into routing on random graphs embedded in a metric space. 

To a first approximation, our approach is similar to the "small-world" routing work by Klein- 
berg in which points in a two-dimensional grid are connected by links drawn from a normalized 
power-law distribution (with exponent 2), and routing is done by having each node route a packet 
to its neighbor closest to the packet's destination. Kleinberg's approach is somewhat brittle be- 
cause it assumes a constant number of links leaving each node. Getting good performance using his 
technique depends both on having a complete two-dimensional grid of nodes and on very carefully 
adjusting the exponent of the random link distribution. We are not as interested in keeping the 
degree down and accept a larger degree to get more robustness. We also cannot assume a complete 
grid: since fault-tolerance is one of our main objectives, and since nodes are mapped to points in 
the metric space based on what resources they provide, there may be missing nodes. 

The use of random graphs is partly motivated by a desire to keep the data structure scalable 
and the routing algorithm as decentralized as possible, as random graphs can be constructed locally 
without global coordination. Another important reason is that random graphs are by nature robust 
against node failures: a node-induced subgraph of a random graph is generally still a random graph; 
therefore, the disappearance of a vertex, along with all its incident links (due to failure of one of 
the machines implementing the data structure) will still allow routing while the repair mechanism 

^Note that since it generally changes with time, and may specifically change while the request is being served, 
the request message may be routed over a series of different overlay networks I'^, . . . , 7**=. 
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is trying to heal the damage. The repair mechanism also benefits from the use of random graphs, 
since most random structures require less work to maintain their much weaker invariants compared 
to more organized data structures. 

Embedding the graph in a metric space has the very important property that the only infor- 
mation needed to locate a resource is the location of its corresponding metric-space point. That 
location is permanent, both in the sense of being unaffected by disruption of the data structure, and 
easily computable by any node that seeks the resource. So, while the pattern of links between nodes 
may be damaged or destroyed by failure of nodes or of the underlying communication network, the 
metric space forms an invulnerable foundation over which to build the ephemeral parts of the data 
structure. 

3 Current peer-to-peer systems 

Most of the peer-to-peer systems in widespread use are not scalable. Napster [S] has a central 
server that services requests for shared resources even though the actual resource transfer takes 
place between the peer requesting the resource and the peer providing it, without involving the 
central authority. However, this has several disadvantages including a vulnerable single point of 
failure, wasted computational power of the clients as well as not being scalable. Gnutella floods 
the network to locate a resource. Flooding creates a trade-off between overloading every node in the 
network for each request and cutting off searches before completion. While the use of super-peers 
[7] ameliorates the problem somewhat in practice, it does not improve performance in the limit. 

Some of these first-generation systems have inspired the development of more sophisticated 
ones like CAN , Chord ^5] and Tapestry . CAN partitions a d-dimensional metric space into 
zones. Each key is mapped to a point in some zone and stored at the node that owns the zone. Each 
node stores 0{d) information, and resource location, done by greedy routing, takes 0{dn^^^) time. 
Chord maps nodes to identities of m bits placed around a modulo 2"^ identifier circle. Resources 
are stored at existing successor nodes of the nodes they are mapped to. Each node stores a routing 
table with m entries such that the i-th entry stores the key of the first node succeeding it by at least 
2*~^ on the identifier circle. Each resource is also mapped onto the identifier circle and stored at the 
first node succeeding the location that it maps to. Routing is done greedily to the farthest possible 
node in the routing table, and it is not hard to see that this gives an O(logn) delivery time with 
n nodes in the system. Tapestry uses Plaxton's algorithm JU], a form of suffix-based, hypercube 
routing, as the routing mechanism: in this algorithm, the message is forwarded deterministically to 
a node whose identifier is one digit closer to the target identifier. To this end, each node maintains 
O(logn) pieces of information and delivery time is also O(logn). 

Although these systems seem vastly different, there is a recurrent underlying theme in the use 
of some variant of an overlay metric space in which the nodes are embedded. The location of 
a resource in this metric space is determined by its key. Each node maintains some information 
about its neighbors in the metric space, and routing is then simply done by forwarding packets 
to neighbors closer to the target node with respect to the metric. In CAN, the metric space is 
explicitly defined as the coordinate space which is covered by the zones and the distance metric 
used is simply the Euclidean distance. In Chord, the nodes can be thought of being embedded on 
grid points on a real circle, with distances measured along the circumference of the circle providing 
the required distance metric. In Tapestry, we can think of the nodes being embedded on a real 
line and the identifiers are simply the locations of the nodes on the real line. Euclidean distance 
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is used as the metric distance for greedy forwarding to nodes with identifiers closest to the target 
node. This inherent common structure leads to similar results for the performance of such networks. 
In this paper, we explain why most of these systems achieve similar performance guarantees by 
describing a general setting for such overlay metric spaces, although most of our results apply only 
in one-dimensional spaces. 

In general, the fault-tolerance properties of these systems are not well-defined. Each system 
provides a repair mechanism for failures but makes no performance guarantees till this mechanism 
kicks in. For large systems, where nodes appear and leave frequently, resilience to repeated and 
concurrent failures is a desirable and important property. Our experiments show that with our 
overlay space and linking strategies, the system performs reasonably well even with a large number 
of failures. 

4 Routing 

In this section, we present our lower and upper bounds on routing. We consider greedy routing 
in a graph embedded in a line where each node is connected to its immediate neighbors and to 
multiple long-distance neighbors chosen according to a fixed link distribution. We give lower bounds 
for greedy routing for any link distribution satisfying certain properties (Theorem llOj) . We also 
present upper bounds in the same model where the long-distance links are chosen as per the inverse 
power-law distribution with exponent 1 and analyze the effects on performance in the presence of 
failures. 

4.1 Tools 

Some of our upper bounds will be proved using a well-known upper bound of Karp et a/.jH] on 
probabilistic recurrence relations. We will restate this bound as Lemma ^ and then show how a 
similar technique can be used to get lower bounds with some additional conditions in Theorem [2j 

Lemma 1 The time T{Xq) needed for a nonincreasing real-valued Markov chain Xq, Xi, X2, . . . 

to drop to 1 is bounded by 

T{Xo)< / —dz, (1) 

when fiz = E[^t — Xt+i : Xt = z] is a nondecreasing function of z. 

This bound has a nice physical interpretation. If it takes one second to jump down fix meters 
from X, then we are traveling at a rate of fix meters per second during that interval. When we zip 
past some position z, we are traveling at the average speed fix determined by our starting point 
X > z for the interval. Since fi is nondecreasing, using fi^ as our estimated speed underestimates 
our actual speed when passing z. The integral computes the time to get all the way to zero if we 
use fiz as our instantaneous speed when passing position z. Since our estimate of our speed is low 
(on average), our estimate of our time will be high, giving an upper bound on the actual expected 
time. 

We would like to get lower bounds on such processes in addition to upper bounds, and we 
will not necessarily be able to guarantee that fiz, as defined in Lemma ^ will be a nondecreasing 
function of z. But we will still use the same basic intuition: The average speed at which we pass z is 
at most the maximum average speed of any jump that takes us past z. We can find this maximum 
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speed by taking the maximum over all x > z; unfortunately, this may give us too large an estimate. 
Instead, we choose a threshold U for "short" jumps, compute the maximum speed of short jumps 
of at most U for all x between z and z + U ^ and handle the (hopefully rare) long jumps of more 
than U by conditioning against them. Subject to this conditioning, we can define an upper bound 
niz on the average speed passing z, and use essentially the same integral as in (pQ) to get a lower 
bound on the time. Some additional tinkering to account for the effect of the conditioning then 
gives us our real lower bound, which appears in Theorem |21 below, as Inequality (jSJ. 

Theorem 2 Let Xq, Xi,X2, . . . be Markov process with state space S, where Xq is a constant. Let 
f be a non-negative real-valued function on S such that, for all t, 

VT[f{Xt) - f{Xt+i) > : Xt] = 1. (2) 

Let U and e be constants such that for any x > 0, 

PT[f{Xt) - f{Xt+i) >U:Xt = x]<e. (3) 

Let 

T = min{t : f{Xt) = 0}. (4) 
For each x with f{x) > 0, let fi^ > satisfy 

fi. > EifiXt) - f{Xt+i) ■.Xt = x, f{Xt) - f{Xt+i) < U]. (5) 



Now define 
and define 

Then 

Proof: Define 



ruz = sup {/i^ : X G S, f{x) £ [z,z + U)] , (6) 



T{x) = / — dz. (7) 

Jo "^2 



Yt = [ ,if/(XtO-/(Xt'+i)<t/foralH'<t, 
, otherwise. 

The idea is that Yt drops to zero immediately if a long jump occurs. We will show that even with 
such overeager jumping, Yt does not drop too quickly on average. The intuition is that the chance 
of a long jump reduces Yt by at most an expected eYt < eYo, while the effect of short jumps can be 
bounded by applying the definition of T. 

Let J^t be the cr-algebra generated hy Xq, Xi, . . . Xt. Let At be the event that f{Xt) — f{Xt+i) < 
U, that is, that the jump from f{Xt) to /(X^+i) is a short jump. Now compute 

E[Yt-Yt+i:Tt] = VT[A't:Tt]{Yt-Q) + {l-VT[A't:Tt])E[Yt-Yt+i::Ft^At] 

< VvIaI: Ft] Y^ + ie-VvYAf. J^t] )Yo + (1 - e) E [Yt - Yt+i : J't, At] 

= e^o + (1 - e) E [Yt - Yt+i : ^t. At] . (10) 
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Now let us bound E [Vt — Yt+i ■ ^t, At]. Expanding the definitions Q and © gives 



E[Yt-Yt+i:J^t,At]=E 



fiXt) I 



(11) 



Now, conditioning on At means that fi^Xt+i) > f{Xt) — U and thus z > f{Xt) — U for the 
entire range of the integral. It follows that f{Xt) lies in the half-open interval [z, z + U) for each 
such z, from which we have ruz > fJ-f{Xt) from Q. Inverting gives ^ < 7^7777, and plugging this 
inequality into (fTT|) gives 

E[Yt-Yt+i:J^t,At] < E 



fiXt) 



1 



-dz : Tt,At 



1 



< 



1 

1. 



■E[f{Xt)-f{Xt+i):J't,At 



^^f{xt) 



Applying (fT2ll to ^ gives 



E[Yt-Yt+i:Tt]<eYQ + {l-e). 



(12) 



(13) 



We have now shown that It drops slowly on average. To turn this into a lower bound on the 
time at which it first reaches zero, define Zt = lt-|-min(t, r) (eYo + (1 — e)). Conditioning on t < r, 
observe that 

E[Zt - Zt+i ■.Tt,t<T] = E[Yt - Yt+i : .Ft, t < r] - (eYo + (1 - e)) 

< (eyo + (l-e))-(e>o + (l-e)) 
= 0. 



Alternatively, if i > r we have 



E[Zt-Zt+i:Tt,t>T] = G. 



In either case, E[^t — ^f+i : < 0, implying Zt < E[Zt+i ■ ^t]- In other words, {Zt,J^t} is a 
submartingale. 

Because {Zt,Tt} is a submartingale, and r is a stopping time relative to {Tt}, we have Zq = 
Yq < E{Zr\ = E [0 + T (eFo + (1 - e))] = (e>o + (1 - e)) E[t]. Solving for E[t] then gives 



E[r] > 



Yc, 



nxo 



eYo + (1 - e) eT{Xo) + (1 - e) ' 
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4.2 Lower bounds on greedy routing 

We will now show a lower bound on the expected time taken by greedy routing on a random graph 
embedded in a line. Each node in the graph has expected outdegree at most i and is connected to 
its immediate neighbor on either side. For polylogarithmic values of ^, we consider two variants of 
the greedy routing algorithm and derive lower bounds for them equal to Vlilo^ nj log log n)) and 
to 0(log^ n/(^loglogn)), as stated in Theorem IIUI The routing variants, along with the machinery 
and proofs of the associated lower bounds, are presented in Sections 14. 2 . 1 1 through 1^. 2 .61 For large 
values of £, a lower bound of ^Q^) on the worst-case routing time can be derived very simply, as 
follows. 

Theorem 3 Let I S (Ig n, rf\ . Then for any link distribution and any routing strategy, the delivery 
UmeT = nC^). 

Proof: With i links for each node, we can reach at most £^ nodes at step k. Assuming that 
the minimum time to reach all n nodes is T, i"^ = n. This gives a lower bound of ^^(t^j) on T. I 

4.2.1 Lower bound for polylogarithmic number of links 

We consider the case of the expected outdegree of each node falling in the range [l,lgn]. The 
probability that a node at position x is connected to positions x — Ai,x — A2, . . . ,x — depends 
only on the set A = {Ai, . . . , Afc} and not on x and is independent of the choice of outgoing links 
for other nodes. ^ Since we assume that each node is connected to its immediate neighbors, we 
require that ±1 appears in A. 

We consider two variants of the greedy routing algorithm. Without loss of generality, we assume 
that the target of the search is labeled 0. In one-sided greedy routing, the algorithm never traverses 
a link that would take it past its target. So if the algorithm is currently at x and is trying to reach 
0, it will move to the node x — Ai with the smallest non-negative label. In two-sided greedy routing, 
the algorithm chooses a link that minimizes the distance to the target, without regard to which 
side of the target the other end of the link is. In the two-sided case the algorithm will move to a 
node X — Ai whose label has the smallest absolute value, with ties broken arbitrarily. One-sided 
greedy routing can be thought of as modeling algorithms on a graph with a boundary when the 
target lies on the boundary, or algorithms where all links point in only one direction (as in Chord). 

Our results are stronger for the one-sided case than for the two-sided case. With one-sided 
greedy routing, we show a lower bound of Q{log^ n/{£\oglogn)) on the time to reach from a 
point chosen uniformly from the range 1 to n that applies to any link distribution. For two-sided 
routing, we show a lower bound of r2(log^ n/{£'^ log log n)), with some constraints on the distribution. 
We conjecture that these constraints are unnecessary, and that 0(log^ n/(£loglogn)) is the correct 
lower bound for both models. A formal statement of these results appears as Theorem in 
Section I4.2.6( but before we can prove it we must develop machinery that will be useful in the 
proofs of both the one-sided and two-sided lower bounds. 

4.2.2 Link sets: notation and distributions 

First we describe some notation for A sets. Write each A as 
{A_„...A_2 , A_i = -1, Ai = 1, A2, . . . At}, 

^We assume that nodes are labeled by integers and identify each node with its label to avoid excessive notation. 
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where < Aj whenever i < j. Each A is a random variable drawn from some distribution on 
finite sets; the individual Aj are thus in general not independent. Let A~ consist of the s negative 
elements of A and A"*" consist of the t positive elements. Formally define A_j = — oo when i > s 
and Aj = +00 when i > t. 

For one-sided routing, we make no assumptions about the distribution of A except that |A| 
must have finite expectation and A always contains 1. For two-sided routing, we assume that A is 
generated by including each possible 5 in A with probability ps, where p is symmetric about the 
origin (i.e., ps = ps for all 5), pi = p_i = 1, and p is unimodal, i.e. nonincreasing for positive 
5 and nondecreasing for negative 5.^ We also require that the events [5 G A] and [5' G A] are 
pairwise independent for distinct 5, 5' . 

4.2.3 The aggregate chain 5* 

For a fixed distribution on A, the trajectory of a single initial point is a Markov chain 
with = s(X*,A*), where A* determines the out going links from the node 

reached at time t and s is a successor function that selects the next node X^^^ = X^ — A* ac- 
cording to the routing algorithm. Note that the chain is Markov, because the presence of ±1 links 
guarantees that no node ever appears twice in the sequence, and so each new node corresponds to 
a new choice of links. 

From the X^ chain we can derive an aggregate chain that describes the collective behavior of all 
nodes in some range. Each state of the aggregate chain is a contiguous sets of nodes whose labels 
all have the same sign; we define the sign of the state to be the common sign of all of its elements. 
For one-sided routing each state is either {0} or an interval of the form {1 ... A;} for some k. For 
two-sided routing the states are more general The aggregate states are characterized formally in 
Lemma 13 

Given a contiguous set of nodes S and a set A, define 

Sai = {x ^ S : s{x, A) = X — Aj}. 

The intuition is that S'ai consists of all those nodes for which the algorithm will choose Aj as the 
outgoing link. Note that Sai will always be a contiguous range because of the greediness of the 
algorithm. Now define, for each a £ {— ,0,-|-}: 

SAia = {x £ SAi ■■ sgns(x. A) = fj}. 

Here we have simply split Sai into those nodes with negative, zero, or positive successors. 
For any set A and integer 6 write A — 6 for {x — 6 : x € A}. 

We will now build our aggregate chain by letting the successors of a range S be the ranges 
SAia — Aj for all possible A, i, and a. As a special case, we define 5*+^ = {0} when 5* = {0}; once 
we arrive at the target, we do not leave it. For all other S*, we let 

Pr = - Aj : A] = (14) 

and define the unconditional transition probabilities by averaging over all A. 

Lemma^shows that moving to the aggregate chain does not misrepresent the underlying single- 
point chain: 

These constraints imply that po = 1; formally, we imagine that is present in each A but is ignored by the 
routing algorithm. 
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Lemma 4 Let he drawn uniformly from the range . Let be a uniformly chosen element 
of SK Then for all x and t, Pr[X* = x] = Pr[y* = x]. 

Proof: Clearly the lemma holds for t = 0. Fix 5"*"^, and consider two methods for generating 
y*. The first generates directly from y*~i and shows that Y^ generated in this way has the 
same distribution as X^. The second generates y* from S* as describe in the lemma and produces 
the same distribution on y* as the first. 

In the first method, we choose y*"^ uniformly from 5"*"^, choose a random A*~^, and compute 
s(y*~^,A*~^. Here the transition rule applied to y*~i is the same as for so under the 

induction hypothesis that y*~^ and X^~^ are equal in distribution, so are y* and X^. 

In the second method, we again choose a random A*~^ and then choose S* by choosing some 
'^Aia proportion to its size, let 5* = S''^-^ — A^, and then let y* be a uniformly chosen element of 
(S*. We can implement the choice of 5"^^ by choosing some y*~^ uniformly from S**"^ and picking 
"^Aiff ^ subrange that contains y*~^; and we can simplify the task of choosing y* by setting 
it equal to y*~^ — Aj, since conditioning on y*~i g 'S^Ail- leaves y*"^ with a uniform distribution. 
But by implementing the second method in this way, we have reduced it to the first, and the lemma 
is proved. I 

Lemma justifies our earlier characterization of the aggregate state spaces: 

Lemma 5 Let = {1 . . .n} for some n. Then with one-sided routing, every 5* is either {0} or 
of the form {1 . . . k} for some k; and with two-sided routing, every 5* is an interval of integers in 
which every element has the same sign. 

Proof: By induction on t. For one-sided routing, observe that is always empty, as 

the routing algorithm is not allowed to jump to negative nodes. If 5* = — A^, then 5* = 
{Aj} — Aj = {0}. Otherwise S** = S"^^ — Aj; but since 5*"^ = {1 . . . k} for some k, if it contains 
any point x greater than Aj it must contain Aj + 1; thus min(5'^^ = Aj + 1 and so min(S'*) 
becomes 1. 

The result for the two-sided case is immediate from the fact that 5"* = — Aj combined with 
the definition of 5*^^ . I 

The advantage of the aggregate chain over the single-point chain is that, while we cannot do 
much to bound the progress of a single point with an arbitrary distribution on A, we can show 
that the size of S"* does not drop too quickly given a bound i on E[|A|]. The intuition is that each 
successor set of size a"^!^*! or less occurs with probability at most a~^, and there are at most 3i 
such sets on average. 

Lemma 6 Let E[|A|] < i. Then for any a > 1, in either the one-sided or two-sided model, 



Fix 5*. First note that if a~i|5*| < 1, then Pr [jS^+^l < a-^|5*| : 5*] = 0. So we can assume 
that o"^!^*! > 1 and in particular that a < |5*|. 

Conditioning on A, there are at most 3|A| non-empty sets S^^^^. If IS^j^-l < o,~^^\S^\i then l-S^j^l 
is chosen with probability at most by (|14j) . Thus the probability of choosing any of the at most 
3|A| sets S\- of size at most a~^\S''\ is at most 3|A|a~^. 




(15) 



Proof: 
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Now observe that 



^Pr[|A| = d]3da 



d 



< 



3a-^E[|A|] 



I 



Another way to write ^ is to say that Pr [in - In jS^+^l > hia : 5*] < 3£a~^, which wih 
give the bound ^ on the probabihty of large jumps when it comes time to apply Theorem |3 

4.2.4 Boundary points 

LemmaElsays that seldom drops by too large a ratio at once, but it doesn't tell us much about 
how quickly 15*1 drops in short hops. To bound this latter quantity, we need to get a bound on 
how many subranges 5* splinters into through the action of A). We will do so by showing that 
only certain points can appear as the boundaries of these subranges in the direction of 0. 
For fixed A, define for each i > 



Let P be the set of all finite (3i and 

Lemma 7 Fix S and A and let (3 he defined as above. Suppose that S is positive. Let M = 

{min(5Ajo-) : -SAio- 7^ 0} be the set of minimum elements of subranges S'aio- of S. Then M is a 
subset of S and contains no elements other than 

1. min(S'), 

2. Aj for each i > 0, 

3. Aj + 1 for each i > 0, and 

4. at most one of /?, or Pi + 1 for each i > 0, 

where the last case holds only with two-sided routing. 

If S is negative, the symmetric condition holds for M = {max(S'Ajo-) : S/\ia 7^ 0}- 

Proof: Consider some subrange Saiu of S. If Saiu contains min(S'), the first case holds. 
Otherwise: (a) if Saio- = •S'AiO) the second case holds; (b) if 5Ai<T = SAi+, the third case holds; 
(c) if Saio- = S/\i-, the fourth case holds, with min(S'Ai-) = Pi-i if A.j_i + Aj is odd, and either 
Pi^i or + 1 if Aj_i + Aj is even, depending on whether the tie-breaking rule assigns /3j_i to 

We will call the elements of M boundary points of S. 



2 



and 



A_i + A_j_i 
2 
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4.2.5 Bounding changes in InlS**! 

Now we would like to use Lemmas El and Lemma [7] to get an upper bound on the rate at which 
In 1 5* I drops function of the A distribution. 

The following lemma is used to bound a sum that arises in Lemma 1^1 

Lemma 8 Let c > 0. Let '^^^^ Xi = M where each x,, > and at least one Xi is greater than c Let 
B he the set of all i for which Xi is greater than c. Then 



Proof: If — < c, we still have Xi > c for all i £ -B, so the left-hand side cannot be less than 
Inc. So the interesting case is when ^ > c. 

Let B have b elements. Then Yli(^B ^* < {n — b)c and X^igs > M — {n — b)c = M — nc + be. 
Because Xi In Xi is convex, its sum over B is minimized for fixed X^jg^ Xi by setting all such Xi 
equal, in which case the left-hand side of becomes simply \n.{xi) for any i € B. 

Now observe that setting all Xi in B equal gives Xi = A-z-'ic+bc ^ _^ ^ > + g = M. | 



Lemma 9 Fix a > 1, and let 5 = 5* be a positive range with \S\ > a. Define (3 as in Lemma^ 
Let S' = [min(5)-F \a-'^\S\] -l,max(5) - 1]. Let A be the event [ln|5*| -ln|5*+^| < Ina]. Then 

E [In |5'| - In |5*+i| : S\ A] < In ^ + '""f^^^^.-fK (17) 

where Z = 2| A n 5'| with one-sided routing and Z = 2|A f] S'\ + \P D S'\ with two-sided routing. 

Proof: Call a subrange 5Aio- large if |5Aicr| > o~'^|'S'| and small otherwise; the intent is that 
the large ranges are precisely those that yield ln|5*| — ln|5*"^^| < Ina. Observe that for any large 
SAia, \SAia\ > CL~^\S\ > 1, implying any large set has at least two elements. 

For any large Saict, niax(5Aio-) > min(5) -|- |"a~-'^|5|] — 1. Similarly min(5Aio-) < max(5) — 1. 
So any large 5Aio- intersects S' in at least one point. 

Let T = {Ti,r2, . . . ,Tk} be the set of subranges 5Aj<T, large or small, that intersect 5'. It is 
immediate from this definition that [jT ^ S' and thus X] | > |5'|. 

Using Lemma [71 we can characterize the elements of T as follows. 

1. There is at most one set Tj that contains min(rj). 

2. There is at most one set Tj that has min(Tj) = Aj for each Aj in 5'. 

3. There is at most one set Tj that has min(Tj) = Aj -|- 1 for each A,, in S'. 

4. With two-sided routing, there is at most one set Tj that has min(Tj) = Pi or min(Tj) = /3j -|- 1 
for each Pi in 5'. Note that there may be a set whose minimum element is /3j -|- 1 where 
Pi = min(5') — 1, but this set is already accounted for by the first case. 
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Thus T has at most 1 + Z = l + 2|AnS"| elements with one-sided routing and at most 
1 + Z = l + 2|AnS"| + |/3nS"| elements with two-sided routing. 

Conditioning on |S'*+^| > a~-^|<S'|, liS**"*"-^! is equal to IS'aio-I for some large jSajo- and thus for some 
large Tj G T. Which large Tj is chosen is proportional to its size, so for fixed T, we have 

■\T\ 



> ln(max(a-i|5|,^)) 



> In 

where the first inequality follows from Lemma |H1 
Now let us compute 



5] 
ITI 



E[ln|5*| -ln|5*+^| : 5*,^] = In - E[ln : 5*, A] 

< ln|5*| -E[ln|5'| -ln|r| : 5*,A] 

I 9*1 

= ln^+E[ln|r| : 5*,^] 

< i^M^E[ln|r|:S*] 



5' I Pr[^ : S*] 

lnE[|T| : 5* 



< In^ T + 



Fi[A : 5*] 



In the second-to-last step, we use E[ln|T| : S^,A] < E[ln|T| : 5'*]/Pr[A : 5*], which follows from 
E[ln|T| : 5*] = E[ln|r| : S\A]Pt[A : 5*] +E[ln|r| : S\^A]Fr[^A : S% In the last step, we use 
E[ln|T| : S^,A\ < ln£'[|T| : S'',A], which follows from the concavity of In and Jensen's inequality. 
I 



4.2.6 Putting the pieces together 

We now have all the tools we need to prove our lower bound. 

Theorem 10 Let G be a random graph whose nodes are labeled by the integers. Let A^. for each 
X be a set of integer offsets chosen independently from some common distribution, subject to the 
constraint that —1 and +1 are present in every A^, and let node x have an outgoing link to x — 6 
for each 5 G A^^. Let i = E[|A|]. Consider a greedy routing trajectory in G starting at a point 
chosen uniformly from 1 . . . n and ending at 0. 

With one-sided routing, the expected time to reach is 

\ a log log n J 

With two-sided routing, the expected time to reach is 

\ log log n / 
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provided A is generated by including each 6 in A with probability ps, where (a) p is unimodal, (b) 
p is symmetric about 0, and (c) the choices to include particular 6,6' are pairwise independent. 

Proof: Let 5° = {l...n}. 

We are going to apply Theorem 121 to the sequence , S^, 5^, . . . with f{S) = In We have 
chosen / so that when we reach the target, f{S) = 0; so that a lower bound on r gives a lower 
bound on the expected time of the routing algorithm. To apply the theorem, we need to show that 
(a) the probability that Inl^l drops by a large amount is small, and (b) that the integral in is 
large. 

Let a = S^ln^n. By LemmaEl for aU t, Pr [\S^+^\ < a~^|5*| : 5*] < 3^a~^ = ln~^n, and thus 
Pr[ln|5'*| - In 1 5*+^ I > Ina : 5*] < ln~^n. This satisfies © with [/ = Ina and e = ln~^ n. 

For the second step. Theorem |2l requires that we bound the speed of the change in f{S) solely 
as a function of f{S). For one-sided routing this is not a problem, as Lemma [5] shows that f{S), 
which reveals \S\, characterizes S exactly except when l^l = 1 and the lower bound argument is 
done. For two-sided routing, the situation is more complicated; there may be some which is not 
of the form {1 ... IS**!} or {0}, and we need a bound on the speed at which In jS*] drops that applies 
equally to all sets of the same size. 

It is for this purpose (and only for this purpose) that we use our conditions on A for 
two-sided routing. Suppose that each 5 appears in A with probability ps, that these prob- 
abilities are pairwise-independent, and that the sequence p is symmetric and unimodal. Let 
P = {absceil (^-5^) '■ x,y ^ A,x ^ y}, where absceil (z), the absolute ceiling of z, is \z] when z > 
and [z\ when z < 0. Observe that /3 3 /3; in effect, we are counting in /? all midpoints of pairs 
of distinct elements of 6 without regard to whether the elements are adjacent. For each k, the 
expected number of distinct pairs x, y with x + y = z and x, y G A is at most b^ = Yl'^-oo Pk~iPi] 
this is a convolution of the non-negative, symmetric, and unimodal p sequence with itself and so 
it is also symmetric and unimodal. It follows that for all < /c < k', bk > hfc', and similarly 
b-k > b-k'. 

Now for the punch line: for each 6 0, qs = b25-sgn5 + ^25 is an upper bound on the expected 
number of distinct pairs x,y that put 6 in which is in turn an upper bound on Pr[5 E /3], and 
from the unimodularity of b we have that qs > qs' and qs > q-&> whenever Q < 5 < 5' . Though q 
grossly over counts the elements of (3 (in particular, it gives a bound on E[|/3|] of l"^), its ordering 
property means that we can bound the expected number of elements of (3 that appear in some 
subrange of any positive by using q to bound the expected number of elements that appear in 
the corresponding subrange of {1 ... IS**!}, and similarly for negative 5* and {— 1... — IS"*!}. Because 
Pi already satisfies a similar ordering property, we can thus bound the number of elements of both 
A and {3 that hit a fixed subrange of 5"* given only IS**]. We do this next. 

For convenience, formally define pi = Pr[i G A] and qi = Q for one-sided routing. We will 
simplify some of the summations by first summing the pi and qi over certain pre-defined intervals. 
For each integer i > let = {fc E Z : a* - 1 < A; < a*+^ - 1} = {A; E Z : [In^ A: + Ij = i}. Let 7^ = 
Y.keA,'^Pi + 1i- Note that 7i > 2E[|AinA|] for one-sided routing and 7^ > 2E[|Ajn A|] +E[|Ain/3|] 
for two-sided routing. Observe also that X^^qT* most 2£ for one-sided routing and at most 
21 + £^ for two-sided routing. 

Consider some S = S^. Let A be the event [ln|5*| — InlS^+^l < Ino]. If l^l > a, then by 
Lemma ini we have 



E [ln|S' 



ln|5*+^| : S\A] < In 



1 - a 



1 



-1 



+ 



InE [1 + ^:5*] 
Pr[^ : 5*] 



(20) 
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where Z = 2|A n 5'| with one-sided routing and Z = 2|An5'| + |/3n5'| with two-sided routing, 
with S' = [min(S') -|- |'a~"'^|S'|] — l,max(S') — 1] in each case, as in LemmalHI 

As we observed earher, our choice of a and Lemma El imply PrpnlS*! — InlS"*"*"^! > Ina : S**] < 
ln-3 n, so Pr[^ : 5*] = 1 - Pr[ln - In |5*+i| > Ina : 5*] > 1 - In"^ n > ^ for sufficiently large n. 
So we can replace (I2U() with 

E [ln|S*| -ln|5*+i| : S\A] < In ^ + 21nE [1 + ^:5*1, (21) 

'- -' 1 — a ^ '- 

Let us now obtain a bound on In E[1-|-.^] in terms of j^l and the andgj. For one-sided routing, 
we use the fact that jSI > 1 implies S' = {1...|5|}. For two-sided routing, we use monotonicity of 
the Pi and qi to replace S with {1 ... 15 1}; in particular, to replace a sum of 2pi + qi over a subrange 
of S with a sum over subrange of {1 . . . that is at least as large. In either case, we get that 

/ ^1-1 \ 

lnE[l + Z]<ln 1+ 2K + gi , (22) 

V i=\a-^\SW~l I 



+21n [ 1+ 2p, + <?J, (23) 



and thus E [in IS**] - In |5*+^| : S^,A\ is bounded by 

|Shl 

Wn|S| = In 

i=\a-^S\'\~l 

provided \S\ > a. For IS*! < a, set /iin|s| = In a. 

Let us now compute m^, as defined in ®. For z < Ina, = In a. For larger z, observe that 
niz = supjminisi : < \S\ < ae^}. Now if < \S\ < ae^, then the bounds on the sum in ((2 
both lie between |"a~^e^] — 1 and ae^ — 1, so that 



[ae^-lj 



m, < In 



^^^-i +21n 1+ Y '^Pi + Qi 

y i=[a-ie^]-l 

< In - — - — T + 2 ln(l + 7^/ + j^'+i + 7^'+2) 
1 — a ^ 



where z' = [z/lna\ — 1. 
Finally, compute 



In n 2 



T(lnn) = / dz 

Jo rriz 

/•hi n -j^ 

> / -dz 

Jhia In j::^ 21n(l -h7j,/ -h72/+i -^72/4.2) 

[In n/ In aj — 1 

In a 

^ In + 2 ln(l + 7, + 7^+1 + 7^+2) ' 



> 



To get a lower bound on the sum, note that 



[Inn/lnaJ— 1 [lnn/lnaJ+1 oo 

Y + + ^3 Yl 7i < 3 ^ 7i, 

1=0 i=0 i=0 
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which is at most L = 6£ for one-sided routing and at most L = 6£ + for two-sided routing. In 
either case, because c_|_2in(i+3;) convex and decreasing, we have 

I 1 n r). / 1 n m I — 1 

In a 



[In n/ \na\ 

r(lnn) > ^ 



> 



1=0 l-fi 
[In n/ In aj — 1 



In + 2 ln(l + + 7i+i + 7^+2) 

In a 



i=0 lny^ + 21n(^l+ 
In a [Inn/ InaJ 



ln^^ + 21n(l + 



1— a y [Inn/ In a 



(24) 



We will now rewrite our bound on T(ln n) in a more convenient asymptotic form. We will ignore 
the 1 and concentrate on the large fraction. Recall that a = 3£ln^n, so In a = G(ln£ -|- In Inn). 
Unless £ is polynomial in n, we have Inn/ In a = uj{1) and the numerator simplifies to 0(lnn). 

Now let us look at the denominator. Consider first the term In-; — ^rr- We can rewrite this 
term as — ln(l — a~^); since goes to zero as £ and n grow we have — ln(l — a~^) = @{a~^) = 
@{£~^ n). It is unlikely that this term will contribute much. 

Turning to the second term, let us use the fact that ln(l + x) < x for x > 0. Thus 

21n(l-F- ^- -1 < 2 ^ 



[In n/ In aj y [In n / In a\ 

L{\og I + log log n) 



O 



log n 



and the bound in (|24|) simplifies to $7 (log^ n/ (L(log£ -|- loglogn))) . We can further assume 
that £ = O(log^n), since otherwise the bound degenerates to ri(l), and rewrite it simply as 
(log^n/ (L log log n)) . 

For large L, the approximation ln(l + x) < 1 + Inx for x > 0.59 is more useful. In this case 
(|24|) simplifies to T(lnn) = rj(lnn/ln£), which has a natural interpretation in terms of the tree of 
successor nodes of some single starting node and gives essentially the same bound as Theorem |31 

We are not quite done with Theorem |21 yet, as we still need to plug our T and e into © to get 
a lower bound on E[t]. But here we can simply observe that eT = 0(l/logn), so the denominator 
in (jSJ goes rapidly to 1. Our stated bounds are thus finally obtained by substituting 0{£) or 0{£'^) 
for L. I 



4.2.7 Possible strengthening of the lower bound 

Examining the proof of Theorem 1101 both the £^ that appears in the bound H19|) for two-sided 
routing and the extra conditions imposed on the A distribution arise only as artifacts of our need 
to project each range S onto {1 . . . IS*!} and thus reduce the problem to tracking a single parameter. 
We believe that a more sophisticated argument that does not collapse ranges together would show 
a stronger result: 

Conjecture 11 Let G, A, and £ be as in Theorem MIA Consider a greedy routing trajectory starting 
at a point chosen uniformly from 1 . . . n and ending at 0. 
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Then the expected time to reach is 





) 



I log log n 



with either one-sided or two-sided routing, and no constraints on the A distribution. 

We also believe that the bound continues to hold in higher dimensions than 1. Unfortunately, 
the fact that we can embed the line in, say, a two-dimensional grid is not enough to justify this 
belief; divergence to one side or the other of the line may change the distribution of boundaries 
between segments and break the proof of Theorem 1101 

4.3 Upper Bounds 

In this section, we present upper bounds on the delivery time of messages in a simple metric space: 
a one-dimensional real line. To simplify theoretical analysis, the system is set up as follows. 

• Nodes are embedded at grid points on the real line. 

• Each node u is connected to its nearest neighbor on either side and to one or more long- 
distance neighbors. 

• The long-distance neighbors are chosen as per the inverse power-law distribution with expo- 
nent 1, i.e., each long-distance neighbor v is chosen with probability inversely proportional to 
the distance between u and v. Formally, Pr is the ith neighbor of u] = ( ^(^^^^ )/ (Ylv'j^u dJuy)")^ 
where d{u, v) is the distance between nodes u and v in the metric space. 

• Routing is done greedily by forwarding the message to the neighbor closest to the target node. 

We analyze the performance for the cases of a single long-distance link and of multiple ones, 
both in a failure-free network and in a network with link and node failures. Note that when we say 
node, we actually refer to a vertex in the virtual overlay network and not a physical node as in the 
earlier sections. 

4.3.1 Single Long-Distance Link 

We first analyze the delivery time in an idealized model with no failures and with one long-distance 
link per node. Kleinberg [2| proved that with n"^ nodes embedded at grid points in a d-dimensional 
grid, with each node u connected to its immediate neighbors and one long-distance neighbor v chosen 
with probability proportional to l/d{u,v)'^ , any message can be delivered in time polynomial in 
log n using greedy routing. While this result can be directly applied to our model with d = I and 
/ = 1 to give a 0(log^ n) delivery time, we get a much simpler proof by use of LemmaH We include 
the proof below for completeness. 

Theorem 12 Let each node be connected to its immediate neighbors (at distance 1) and 1 long- 
distance neighbor chosen with probability inversely proportional to its distance from the node. Then 
the expected delivery time with n nodes in the network is T{n) = 0{H^). 
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Figure 2: All the possible distances that can be covered from source node s. 



Proof: Let /i^ be the expected number of nodes crossed when the message is at a node that 
is at a distance k from the destination. Clearly, /xfc is non-decreasing. 
Let 



/^A; = 1 1 1- 



n2+k 1 1 
=2k i ' ^ 



where 



Then 



S S 



S S 

ni~k n2+k 



I ^ — ' Z 

i=l i=l 



1 k k 

f^k> ^[k + + ^^ni-fc + Hn2+k - H2k] > e > ^ 



Clearly, /i^ is non-decreasing, and thus using Lemma Q we get 

Thus with this distribution, the delivery time is logarithmic in the number of nodes. I 



4.3.2 Multiple Long-Distance Links 

The next interesting question is whether we can improve the O(log^n) delivery time by using 
multiple links instead of a single one. In addition to improvement in performance, multiple links also 
give the benefit of robustness in the face of failures. We first look at improvement in performance 
by using multiple links in the system and then go onto analysis of failures in Section ^231 Suppose 
that there are i links from each node. We consider different strategies for generating links and 
routing depending on number of links i in two ranges: i £ [1, Ig n] and £ G (Ig n, n'^] . 

In [S], Kleinberg uses a group structure to get a delivery time of 0(log?i) for the case of a 
poly logarithmic number of links. However, he uses a more complicated algorithm for routing while 
we obtain the same bound (for the case of a line) using only greedy routing. 

4.3.2.1 Upper Bound Let us first consider a randomized strategy for link distribution when 

ee[i,lgn]. 

Theorem 13 Let each node be connected to its immediate neighbors (at distance 1) and I long- 
distance neighbors chosen independently with replacement with probability proportional to their dis- 
tances from the node. Let £ G [l,lgn]. Then the expected delivery time T{n) = 0(log^n/£). 
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Figure 3: Multiple long-distance links for each node. 



Proof: The basic idea for this proof comes from Kleinberg's model 0. Kleinberg considers a 
two-dimensional grid with nodes at every grid point. The delivery of the message is divided into 
phases. A message is said to be in phase j if the distance from the current node to the destination 
node is between 2-' and 2^~^^. There are at most (lgn-|- 1) such phases. He proves that the expected 
time spent in each phase is at most O(logn), thus giving a total upper bound of 0(log^ n) on the 
delivery time. We use the same phase structure in our model, and this proof is along similar lines. 

In our multiple-link model, each node has i long-distance neighbors chosen with replacement. 
The probability that u chooses a node v as its long-distance neighbor is 1 — (1 — qY, where q = 
• S^t a lower bound on this probability as follows: 



E 



1 -(!-,)' > 1-(1 



q£ + 
-I) 



1 



> 



1_ n + 1 
2 2 



1 - — 
2 



q£ 



1)9 



Notice that £q < 1, because q < and £ <lgn. So, the probability that u chooses v as its 
long-distance neighbor is at least 







1 




> qi 


1 - - 


2 




2 



^ = £[2diu,v)H„ 



should enter a set of nodes Bj at a distance < 2^ of the destination node t. There are at least 2^ 
nodes in Bj, each within distance 2^~^^ + 2^ < 2^'^'^ 
> ojp 1 . — ^ 



Now suppose that the message is currently in phase j. To end phase j at this step, the message 

af the destination node t. There are at least 2^ 
of u. So the message enters Bj with probability 

Let Xj be the total number of steps spent in phase j. Then 

i=l i=l " 

Now if X denotes the total number of steps, then X = X^j^ Xj, and by linearity of expectation, 
we get EX<{\^ lgn)(8F„/^) = ©(log^ n/^). I 

For £ E (Ig n,n'^], we use a deterministic strategy. We represent the location of each node as a 
number in a base b > 2, and generate links to nodes at distances Ix, 2x, 3x, . . . ,{b — l)x, for each 



8Hn 
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X G {b^,b^, . . . , b^^"^b Routing is done by eliminating the most significant digit of the distance 

at each step. As this distance can be at most b^^°^b'^]^ we get T(n) = 0(log^?7-). This strategy is 
similar in spirit to Plaxton's algorithm jlOj . 

Some special cases are instructive. Let £ = O(logn) and let each node link to nodes in both 
directions at distances 2*,1 < i < 2^°^""^, provided nodes are present at those distances. This 
gives T(n) = O(logn). Similarly let £ = 0{y/n). Links are established in both directions to 
existing nodes at distances 1, 2, 3, ... , y/n, 2y/n, 3-y/n, . . . , yfni^yfn — 1), giving T(n) = 0(1). In fact, 
r(n) = 0(1) when b = n'^, for any fixed c. 

Theorem 14 Choose an integer b > 1. With i = (b — l)[log^?i], let each node link to nodes 
at distances Ix, 2x, 3x, . . . ,{b — l)x, for each x G {b^, b^, . . . , b^^^^b"^]-^^ _ Then the delivery time 
r(n) =0(logfen). 

Proof: Let di,d2, ■ ■ ■ dt be the distances of the successive nodes in the delivery path from the 
target t, where di is the distance of the source node and dt = 0. For each di, 3ki S {0, 1, . . . , [log^ nj } 
such that 

b''' < di < 6'=^+^ 

Hence 

Now each node is connected to the node at distance b''^ [^J . We get 

di+i =di- b^^ 1^1 = di mod < 6^' . 
Thus ki drops by at least 1 at every step. As ki < [log^n], we get T(n) = 0(log{,n). I 

4.3.3 Failure of Links 

It appears that our linking strategies may fail to give the same delivery time in case the links fail. 
However, we show that we get reasonable performance even with link failures. In our model, we 
assume that each link is present independently with probability p. Let us first look at the random- 
ized strategy for number of links i £ [1, Ign]. 

LINK PRESENT 




Figure 4: Each long-distance link is present with probability p. 

Our proof is along similar lines as our proof for the case of no failures. Intuitively, since some 
of the links fail, we expect to spend more time in each phase and this time should be inversely 
proportional to the probability with which the links are present. We prove that the expected time 
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spent in one phase is 0{logn/p(.), which gives a total dehvery time of 0{\og^ n/pi). We assume 
that the hnks to the immediate neighbors are always present so that a message is always delivered 
even if it takes very long. 



Theorem 15 Let the model be as in Theorem M^ Assume that the links to the immediate neighbors 
are always present. If the probability of a long-distance link being present is p, then the expected 



Proof: Recall that in case of no link failures, the probability that u chooses a node v as its 
long-distance neighbor is at least g£/2 where q = ^ '^^'^'m — w • 



Now when we consider link failures, given that u chose v as its long-distance neighbor, the 
probability that there is a link present between u and v is p. So, the probability that u chooses a 
node V as its long-distance neighbor is at least pqi/2 = pi[2d{u,v)Hn]~^ . 

The rest of the proof is the same as the proof for theorem 1131 Let Xj be the number of steps 
spent in phase j. Then 



If X denotes the total number of steps, then by linearity of expectation, we get EX < (1 + 



We turn to the deterministic strategy with i G (lgn,n'^] links. A similar intuition works for 
i S (lgn,n'^]. If a link fails, then the node has to take a shorter long-distance link, which will not 
take the message as close to the target as the initial failed link. Clearly as p decreases, the message 
has to take shorter and shorter links which increases the delivery time. 

To make the analysis simpler, we change the link model a bit and let each node be connected 
to other nodes at distances W ,b'^ , . . . , b^^°^''^^ . Once again, we compute the expected distance 
covered from the current node and use Lemma ^ to get a delivery time of 0{blogn/p). As p 
decreases, the delivery time increases; whereas as b decreases, the delivery time decreases, but the 
information stored at each node increases. 

Theorem 16 Let the number of links be ©(log^n), and let each node have a link to distances 
6'^, 5^, 6^, . . . , fel-iogt^J . Assume that the links to the nearest neighbors are always present. If the 
probability of a link being present is p, then the delivery time T[n) = 0{bHn/p). 

Proof: Let the distance of the current node from the destination be k. Let fi^ represent the 
distance covered starting from this node. Then with probability p, there will be a link covering 
distance b^^°^''>^\_ if this link is absent with probability q = 1 — p, then we can cover a distance 
foLiogifcJ-i with a single link with probability pq and so on. In general, the average distance fik 



delivery time is 0{\o^ n/pl). 




lgn){8Hjpi) = 0{log^n/pi). I 
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covered when the message is at distance k from the destination is 



> 



> 



> 



pi) [logi, _|_ [logb fcj - 1 _|_ _ _ _|_ [logb fcJ - 1 5I _|_ g [logs fcj 5O 
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p{k-l) 
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Using Lemma n we get 



r(n) < 



n ^ n 
fc=l ^'^ fc=2 



2(6 -g) 
p(A: - 1) 



1 + 



2{b- 



E 

.fc=2 



(fc-1) 



0{bHjp). 



4.3.4 Failure of Nodes 

We consider two different cases of node failures when we study their effect on system performance. 
In the first case, as described in Section ES^3 some of the nodes may fail and then the remaining 
nodes will link to each other as per the link distribution. In the second case, as explained in 
Section 14.3.4.21 the nodes first link to their neighbors and then some of the nodes may fail. 

4.3.4.1 Binomially Distributed Nodes Let p be the probability that a node is present at any 
point. Here also, each node is connected to its nearest neighbors and one long-distance neighbor. 
In addition, the probability of choosing a particular node as a long-distance neighbor is conditioned 
on the existence of that node. 

Theorem 17 Let the model he as in Theorem M^ Let each node be present with probability p and 
all nodes link only to existing nodes. Then the worst-case expected delivery time is O(log^n). 

Proof: We bound the expected drop fi^ as follows: 
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Using Lemma ^ we get T{n) < 1/ fik = 0{H^). This is exactly the same result that we 

get in Section [4.3.11 where all the nodes are present. I 

This result is not surprising because if nodes link only to other existing nodes, the only difference 
is that we get a smaller random graph. This does not affect the routing algorithm or the delivery 
time. 

4.3.4.2 General Failures We observe that the analysis for node failures is not as simple as 
that for link failures because we no longer have the important property of independence that we 
have in the latter case. In the case of link failures, the nodes first choose their neighbors and then 
it is possible that some of these links fail; thus, the event that a node is connected to another node 
is completely independent of the event that, say, its neighbor is connected to the same node. Each 
link fails independently, and so the accessibility of a target node from any other node depends only 
on the presence of the link between the two nodes in question. 

In case of node failures, this important independence property is no longer true. Suppose that 
a node u cannot communicate with some other node v (because v failed), even though there may 
be a functional link between u and v. Now the probability of some other node w being able to 
communicate with v is not independent of the probability that u can communicate with v because 
the probability of v being absent is common for both the cases. This complicates the analysis of 
the performance because it is no longer the case that if one node cannot communicate with some 
other node, it has a good chance of doing so by passing the message to its neighbor. 

In order to analyze this situation, we consider jumps only to one phase lower rather than 
jumping over several phases. The idea is that the jumps between phases are independent, so once 
we move from phase j to phase j — 1, further routing no longer depends on any nodes in phase 
j. We can condition on the number of nodes being alive in the lower phase and estimate the time 
spent in each phase. Intuitively, if a node is present with probability p, we would expect to wait 
for a time inversely proportional to p in anticipation of finding a node in the lower phase to jump 
to. 

Theorem 18 Let the model be as in Theorem and let each node fail with probability p. Then 
the expected delivery time is 0(log^n/(l —p))- 

Proof: Let T be the time taken to drop down from layer j to layer j — 1. Let I out of nodes 
be alive in layer j — 1 and let q be the probability that a node in layer j is connected to some node 
in layer j — 1- Then the expected time to drop to layer j — 1, given that there are I live nodes in 



23 



it, is given by 
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E[T\l] 



Now I can vary between 1 and A'^. (Note that I cannot be because if there are no live nodes 
in the lower layer, the routing fails at this point.) We get 



E[T] 



< 



1=1 ^ L 


(1- 


^ 1=1 


-'(1- 


iV A 2 




2N 




g(iV + l)(l 


-P) 


2N 




q{N + 


-P) 


2N 




q{N + 


-P) 



N 



[p + {l-p)\ 



N+1 



Not surprisingly, the expected waiting time in a layer is inversely proportional to the probability 
of being connected to a node in the lower layer and to the probability of such a node being alive. 

For our randomized routing strategy with [l,lgn] links, q ~ l/{Hnt). Since there are at most 
(Ign + 1) layers, we get an expected delivery time of 0(log^ ^/(l ~ p)^)- i 

In contrast, for our deterministic routing strategy, certain carefully chosen node failures can 
lead to dismal situations where a message can get stuck in a local neighborhood with no hope of 
getting out of it or eventually reaching the destination node. We conjecture that this should be a 
very low probability event, so its occurrence will not affect the delivery time considerably. We have 
not yet analyzed this situation formally. 



5 Construction of Graphs 

As the group of nodes present in the network changes, so does the graph of the virtual overlay 
network. In order for our routing techniques to be effective, the graph must always exhibit the 
property that the likelihood of any two vertices v^u being connected is d{v,u)). We describe 
a heuristic approach to construct and maintain a random graph with such an invariant. 

Since the choice of links leaving each vertex is independent of the choices of other vertices, we can 
assume that points in the metric space are added one at a time. Let v be the fe-th point to be added. 
Point V chooses the sinks of its outgoing links according to the inverse power law distribution with 
exponent 1 and connects to them by running the search algorithm. If a desired sink u is not present. 
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V connects to u's closest live neighbor. In effect, each of the k — 1 points already present before v is 
surrounded by a basin of attraction, collecting probability mass in proportion to its length. Since we 
assume the hash function populates the metric space evenly, and because of absolute symmetry, the 
basin length L has the same distribution for all points. It is easy to see that with high probability, 
L will not be much smaller than its expectation: Prob[L < c • k~^] = 1 — (1 — c • k^^)^~^. A lower 
bound on the probability that the link {v,u) is present is c' • ■ d{v,f)~^, where / is the point 
in u's basin that is the farthest from v.^ However, the bound holds only if u is among the k — 1 
points added before v. Otherwise, the aforementioned probability is 0, which means that we need 
to amend our linking strategy to transfer probability mass from the case of u having arrived before 

V to the case of u having arrived after v. We describe next how to accomplish this task. 

Let V he a new point. We give earlier points the opportunity to obtain outgoing links to v by 
having v (1) calculate the number of incoming links it "should" have from points added before it 
arrived, and (2) choose such points according to the inverse power-law distribution with exponent 
1.^ If i is the number of outgoing links for each point, then £ will also be the expected number of 
incoming links that v has to estimate in step (1). We approximate the number of links ending at 

V by using a Poisson distribution with rate i, that is, the probability that v has k incoming links 

— I jk 

is ^, , and the expectation of the distribution is £. 

After step (2) is completed by v, each chosen point u responds to v's request by choosing one 
of its existing links to be replaced by a link to v. The choice of the link to replace can vary. We 
use a strategy that builds on the work of Sarshar et al. \12\. In that work, the authors use ideas 
of Zhang et ai.|15j to build a graph where each node has a single long-distance link to a node at 
distance d with probability 1/d. When a node with a long-distance link at distance di encounters 
a new node at distance d2, either due to its arrival or due to a data request, it replaces its existing 
link with probability P2/{pi + P2), where pi = 1/di, and links to the new node. We extend this 
idea to our case of multiple long-distance links. Consider a node u with k neighbors at distances 
di,d2, ■ ■ ■ ,dk- When a new node v at distance dk+i requests an incoming link from u, u replaces 
one of its existing links with a link to v with probability Pk+i/ Yl'jtiiPj- This is a trivial extension 
of the formula p2/(pi +P2) of ^2]. However, this probability must now be distributed among u's k 
existing long-distance links since u needs to choose one of them to redirect to v. We choose to do 
that according to the inverse power-law distribution with exponent 1, that is, u chooses to replace 
its link to the node at distance di, 1 < i < k, with probability {pi/ ^^^^iPj)- Hence, the probability 
that u decides to link to v and decides to replace its existing link to the node at distance di with a 
link to V is equal to (pi/ X]j=i Pj) ' (Pfc+i/ Xljii Pj)- Notice that u may decide not to redirect any of 
its existing links to v with probability 1 —pk+i/ ^^=iPj- The intuition for using such replacement 
strategy comes from the invariant that we want to maintain dynamically as new nodes arrive: u has 
a link to a node i at distance di with probability inversely proportional to di] hence, conditioning 

*The constant c' has absorbed c and the normaUzing constant for the distribution. 
^All this can be easily calculated by v since the link probabilities are symmetric. 
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on u having k long-distance links, the following equation must hold. 



Prob[M replaces link to i with link to v] = Prob[M has a link to i before v arrives] 

— Prob[n has a link to i after v arrives] 

Pi Pi 



j = lPj l^j = lPj 

Pi Pk+l 



j = lPj l^j = lPj 

The same heuristic can be used for regeneration of links when a node crashes. 

To analyze the performance of the heuristic in practice, we used it to construct a network of 
2^^ nodes with 14 links each, ten separate times. After averaging the results over the ten networks, 
we plotted the distribution of long-distance links derived from the heuristic, along with the ideal 



inverse power-law distribution with exponent 1, as shown in Figure 5(a) We see that the derived 



distribution tracks the ideal one very closely, with the largest absolute error being roughly equal 



to 0.022 for links of length 2, as shown in the graph of Figure 5(b) 

We also performed experiments for an alternative link replacement strategy: a node chooses its 
oldest link to replace with a link to the new node. The performance of this strategy is almost as 
good as the performance of our replacement strategy described previously. We omit those results 
because it is difficult to distinguish between the results of the two strategies on the scale used for 
our graphs. 

There has also been other related work (2] on how to construct, with the support of a central 
server, random graphs with many desirable properties, such as small diameter and guaranteed 
connectivity with high probability. Although it is not clear what kind of fault-tolerance properties 
this approach offers if the central server crashes, or how the constructed graph can be used for 
efficient routing, it is likely that similar techniques could be useful in our setting. 



6 Experimental Results 

We simulated a network of n = 2^^ nodes at the application level. Each node is connected to its 
immediate neighbors and has Ign = 17 long-distance links chosen as per the inverse power law 
distribution with exponent 1 as explained in Section f4. 31 Routing is done greedily by forwarding a 
message to the neighbor closest to its target node. In each simulation, the network is set up afresh, 
and a fraction p of the nodes fail. We then repeatedly choose random source and destination nodes 
that have not failed and route a message between them. For each value of p, we ran 1000 simulations, 
delivering 100 messages in each simulation, and averaged the number of hops for successful searches 
and the number of failed searches. 

With node failures, a node may not be able to find a live neighbor that is closer to the target 
node than itself. We studied three possible strategies to overcome this problem as follows. 

1. Terminate the search. 

2. Randomly choose another node, deliver the message to this new node and then try to deliver 
the message from this node to the original destination node (similar to the hypercube routing 
strategy explained in [H]). 
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(a) The derived distribution. 



(b) Absolute error. 



Figure 5: (a) The distribution of long-distance links produced by the inverse-distance heuristic 
(DERIVED) compared to the ideal inverse power-law distribution with exponent 1 (IDEAL), (b) 
The absolute error between the derived distribution and the ideal inverse power-law distribution 
with exponent 1. 



3. Keep track of a fixed number (in our simulations, 5) of nodes through which the message is 
last routed and backtrack. When the search reaches a node from where it cannot proceed, it 
backtracks to the most recently visited node from this list and chooses the next best neighbor 
to route the message to. 

For all these strategies we note that once a node chooses its best neighbor, it does not send the 
message to any other link if it finds out that the best neighbor has failed. 

Figure IHl shows the fraction of messages that fail to be delivered and the number of hops for 
successful searches versus the fraction of failed nodes. We see that the system behaves well even 
with a large number of failed nodes. In addition, backtracking gives a significant improvement in 
reducing the number of failures as compared to the other two methods, although it may take a 
longer time for delivery. We see that in the case of random rerouting, the average delivery time 
does not increase too much as the probability of node failure increases. This happens because quite 
a few of the searches fail, so the ones that succeed (with a few hops) lead to a small average delivery 
time. 

Our results may not be directly comparable to those of CAN[TT] and Chord [T^. since they use 
different simulators for their experiments. However, to the extent that the results are comparable, 
our methods appear to perform as well as theirs. Even if we just terminate the search, we get less 
than p fraction of failed searches with p fraction of failed nodes. Chord has roughly the same 
performance after their network stabilizes using their repair mechanism. Further, with backtracking 
we see that with 80% failed nodes, we still get less than 30% failed searches. These results are very 
promising and it would be interesting to study backtracking analytically. 

We also compared the performance of the ideal network and that of the network constructed 
using the heuristics given in Section El We ran 10 iterations of constructing a network of 16384 
nodes, both ideally as well as according to the heuristic, and delivered 1000 messages between 
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Figure 6: (a) The fraction of messages that fail to be dehvered as a function of the fraction of failed 
nodes, (b) The average delivery time for successful searches as a function of the fraction of failed 
nodes. 

randomly chosen nodes. Figure shows the number of failed searches as the probability of node 
failure increases. We see that although the network constructed using the heuristic does not perform 
as well as the ideal network, the number of failed searches is comparable. 




Figure 7: Fraction of failed searches. 
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[l,lg n] 
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Pr[Node present] =p 


[l,lg n] 







Table 1: Summary of upper and lower bounds for routing.^ 

7 Conclusions and Future Work 

Tabled summarizes our upper and lower bounds. We have shown that greedy routing in an overlay 
network organized as a random graph in a metric space can be a nearly optimal mechanism for 
searching in a peer-to-peer system, even in the presence of many faults. We see this as an important 
first step in the design of efficient algorithms for such networks, but many issues still need to be 
addressed. Our results mostly apply to one-dimensional metric spaces like the line or a circle. 
One interesting possibility is whether similar strategies would work for higher-dimensional spaces, 
particularly ones in which some of the dimensions represent the actual physical distribution of 
the nodes in real space; good network-building and search mechanisms for this model might allow 
efficient location of nearby instances of a resource without having to resort to local flooding (as 
in Another promising direction would be to study the security properties of greedy routing 
schemes to see how they can be adapted to provide desirable properties like anonymity or robustness 
against Byzantine failures. 
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